The performance of FIT-based and other risk prediction models for colorectal neoplasia in symptomatic patients: a systematic review

Summary Background Colorectal cancer (CRC) incidence and mortality are increasing internationally. Endoscopy services are under significant pressure with many overwhelmed. Faecal immunochemical testing (FIT) has been advocated to identify a high-risk population of symptomatic patients requiring definitive investigation by colonoscopy. Combining FIT with other factors in a risk prediction model could further improve performance in identifying those requiring investigation most urgently. We systematically reviewed performance of models predicting risk of CRC and/or advanced colorectal polyps (ACP) in symptomatic patients, with a particular focus on those models including FIT. Methods The review protocol was published on PROSPERO (CRD42022314710). Searches were conducted from database inception to April 2023 in MEDLINE, EMBASE, Cochrane libraries, SCOPUS and CINAHL. Risk of bias of each study was assessed using The Prediction study Risk Of Bias Assessment Tool. A narrative synthesis based on the guidelines for Synthesis Without Meta-Analysis was performed due to study heterogeneity. Findings We included 62 studies; 23 included FIT (n = 22) or guaiac Faecal Occult Blood Testing (n = 1) combined with one or more other variables. Twenty-one studies were conducted solely in primary care. Generally, prediction models including FIT consistently had good discriminatory ability for CRC/ACP (i.e. AUC >0.8) and performed better than models without FIT although some models without FIT also performed well. However, many studies did not present calibration and internal and external validation were limited. Two studies were rated as low risk of bias; neither model included FIT. Interpretation Risk prediction models, including and not including FIT, show promise for identifying those most at risk of colorectal neoplasia. Substantial limitations in evidence remain, including heterogeneity, high risk of bias, and lack of external validation. Further evaluation in studies adhering to gold standard methodology, in appropriate populations, is required before widespread adoption in clinical practice. Funding 10.13039/501100000272National Institute for Health and Care Research (NIHR) [10.13039/501100000664Health Technology Assessment Programme (HTA) Programme (Project number 133852).


Introduction
Colorectal cancer (CRC) is the third most common cancer and second most common cause of cancer death worldwide, accounting for 1.9 million new cases and 935,000 deaths in 2020. 1 The incidence of CRC is increasing and it is predicted that, by 2040 the number of new CRC cases globally per year will reach 3.2 million. 2 This rise is based on projections of population ageing, population growth and human development. 2,3 Most CRCs develop from pre-cancerous colorectal lesions (adenomas or serrated polyps) progressing, if left in situ, to CRC. 4,5 This natural history means that there is considerable opportunity for cancer prevention if precancerous lesions can be detected early and removed. Whilst population-based screening is effective in reducing incidence and mortality, 6 the overwhelming majority of CRCs are diagnosed after symptoms develop, such as a change in bowel habit, abdominal pain, weight loss or the presence of iron deficiency anaemia. 7,8 Colonoscopy, by allowing direct visualisation of the colonic mucosa, is the preferred investigation for those with suspected CRC. 9 However, patients can experience pain, discomfort or anxiety before, during or after the procedure, and there is a risk (albeit small) of significant complications including haemorrhage and perforation. 10,11 Moreover, demand on endoscopy services is increasing. In the United Kingdom (UK), for example, less than three-quarters of services meet targets for prompt investigation of patients referred for urgent investigation of symptoms. 12,13 Until recently, there was no test to identify those higher-risk symptomatic patients warranting colonoscopy, nor to determine the urgency of investigation. In recent years, driven by growing demand for colonoscopy, researchers and service providers have explored the utility of Faecal Immunochemical Testing (FIT) in symptomatic populations. 14,15 FIT is simple, non-invasive, can be completed by the patient at home, and is relatively cheap, making it attractive for widespread use. There is evidence to suggest that FIT is powerful in identifying a high-risk sub-population when used in symptomatic patients. 14 As a consequence, guidance has begun to advocate routine use of FIT in patients with features of possible CRC. 16 Alongside this, interest has grown in the development of risk prediction models-statistical models that combine information from two or more variables to predict the likelihood of an outcome-which seek to identify which sub-groups of symptomatic patients (e.g. defined by FIT result and/or a combination of other factors such as age, sex or medical history) are most likely to have precancerous lesions or CRC. 17 The hope is that routine implementation of the algorithms in such models could provide an efficient way for health services to ensure that those patients most at risk undergo colonoscopy in a timely manner, while those at lowest risk avoid unnecessary procedures. 18, 19 The aim of this systematic review was to identify, and assess the performance of, models that predict the risk of CRC and/or advanced colorectal polyps (ACP) in symptomatic patients, with a particular focus on those models that include FIT.

Study design
The review was registered with the International Prospective Register of Systematic Reviews (PROSPERO) (CRD42022314710) (Supplementary File 1) and has been conducted and reported in line with the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA) statement. 20 The eligibility criteria were developed using the PI-COTS (Population, Intervention, Comparator, Outcome, Timing, Setting) framework 21 (Supplementary File 1). We included studies assessing symptomatic patients, developing/validating a predictive model (with 2 or more factors) for the prediction of CRC and/or ACP (see Supplementary File 1 for further detail on definition/

Research in context
Evidence before this study Colonoscopy is an expensive and invasive investigation and health services cannot cope with demand. There is a widespread view that less invasive tools are required to determine which patients require colonoscopy. The use of faecal immunochemical testing (FIT) in the symptomatic setting has significantly increased over recent years and, in some settings, guidance now advocates FIT for use in patients with features of possible colorectal cancer (CRC) to guide referral for urgent investigation. There is growing interest in the use of risk prediction models-statistical models that combine information from two or more variables to predict the likelihood of an outcome, and whether these models could further improve performance in identifying those requiring investigation. In this review we included studies assessing symptomatic patients, developing/validating a predictive model (with 2 or more factors) for the prediction of CRC and/or advanced colorectal polyp (ACP) using MEDLINE, EMBASE, Cochrane libraries, SCOPUS and CINAHL electronic databases from inception to April 2023.

Added value of this study
The review provides a comprehensive and up to date review on the ability of risk prediction models (FIT and non-FIT based) to identify colorectal neoplasia. It both updates and extends a past systematic review on this topic (which included papers published to March 2014) and evaluates the evidence in the context of current clinical practice.
Implications of all the available evidence This review shows that there is considerable potential for the use of risk prediction models, both FIT-based and non-FIT based, in identifying those most at risk of colorectal neoplasia. However further evaluation of models is required in 'real world' settings before widespread use in clinical practice can be recommended. Based upon this review this team have undertaken research to develop risk models in the UK population that will be used to guide UK policy.
Review terms used for ACP; in brief we accepted as eligible studies, which used a range of different terms). Studies could be randomised trials or observational studies that were conducted in primary, secondary or tertiary care. Studies utilising primary care databases/cancer registries were included if they did not explicitly state the study population included asymptomatic (screening) individuals. The main outcome was model accuracy (e.g. AUC, sensitivity, specificity) but we also included studies reporting positive predictive values (PPV) for combinations of predictors. In a deviation from protocol, studies reporting PPV, which used age or sex in combination with one other factor were not considered predictive models, as these generally involved simply calculating PPV for strata of the study population based on demographics; however, studies reporting PPV which included age and sex and at least one other factor were eligible. Studies were also excluded if they were not in English; assessed screening or surveillance only populations or prognostic factors for treatment or outcome of CRC; focused only on genetic variables; or included paediatric populations.
Searches were conducted from database inception to 4th March 2022, and updated on the 28th April 2023, in MEDLINE, EMBASE, Cochrane libraries, SCOPUS and CINAHL. The search strategy was developed by an information specialist in combination with the review team, utilising a pre-existing prognostic study filter. 22 The complete search strategy can be seen in Supplementary File 2. Additionally, forward and backward citation searching was conducted on all included studies and systematic reviews identified as being relevant.
Study selection was conducted in two stages, first screening citations and then full text of potentially eligible papers, using Rayyan 23 by two reviewers (JSH & RPWK) independently. A third reviewer (LS) arbitrated any conflicts at both title and abstract and full text screening stages. A data extraction form based on CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) was created and utilised. 24 Data were extracted by a single reviewer (JSH or RPKW) and checked for accuracy by a second reviewer (JSH or RPKW). For further information of what data was extracted, please see Supplementary File 1. The Prediction study Risk Of Bias Assessment Tool (PROBAST) was used to assess the risk of bias. 25 One reviewer (JSH or RPKW) assessed risk of bias, with the second reviewer (JSH or RPKW) checking for accuracy.

Synthesis methods & statistical analysis
No statistical analyses were conducted due to heterogeneity of the studies, which meant a meta-analysis was not possible. We include forest plots for studies that report measures of discrimination (i.e. AUC) as a visual representation only. These forest plots do not include a summary of the effect size (weighted or unweighted) as computing these was not deemedstatistically appropriate. A narrative synthesis based on the guidelines for Synthesis Without Meta-analysis was therefore completed. 26 For the purpose of synthesis, studies were categorised into FIT and non-FIT containing models. Where models included guaiac faecal occult blood testing (gFOBT) they were grouped with FIT containing models since both methods detect blood in stool to aid synthesis, where studies with binary outcomes reported a c-statistic, this has been referred to as AUC.

Role of the funding source
The funders played no role in the study design, collection, analysis, and interpretation of data, nor the writing of the report or the decision to submit the paper for publication. JSH and RPWK accessed and verified the data. LS, CJR and WH made the decision to submit the manuscript for publication.

Models including FIT
Twenty-three of the studies included FIT (n = 22) or gFOBT (n = 1) combined with one or more other variables (Table 2) only, 30,34,40,[43][44][45]60,65,70,81 four studies presented validations of models, 30,46,49 three studies presented both development and validation, 18,19,63 and six were classed as PPV only studies (i.e. they reported PPVs for FIT in combination with at least one other factor). 32,36,51,67,68,71 The cut-off considered positive for FIT varied between studies (Table 2). One study classed any result above zero μg/g of faeces as positive 71 ; another used a cut-off of 0.2 μg/ml, 32 Eleven studies utilised a cut-off between 2 and 25 μg/g of faeces for a positive FIT result. 19,34,40,[43][44][45][46]60,63,67,86 One study assessed four different analytical machines, with a positive FIT varying between machines (2-50 μg/g of faeces). 68 Three studies of the FAST score (an equation based on FIT, age and sex) used different FIT cut-off values. 18,49,85 One study categorised patients by their FIT result between <10 and >400 μg/g of faeces. 51 The final FIT study assessed a cut-off 100 ng/ml. 36 All studies including FIT/gFOBT as a variable were rated as high in the risk of bias. This was generally due to a lack of reporting of adequate calibration statistics ( Fig. 2A).
The most commonly reported model (n = 5) utilised FIT, age and sex (FAST) to produce a score that is assessed against a threshold (e.g. >2.12) for the prediction of both CRC and for can, separately (which is reported below). The FAST score showed good discriminatory ability for CRC when externally validated (AUC = 0.91). 18 Further external validation showed similar results (AUC = 0.87). 49 Three studies performed some form of further validation; these three studies reported similar levels of accuracy (i.e. sensitivity and specificity), but did not report measures of discrimination. 30,46,85 All of these studies were rated high for risk of bias, mainly due to statistical concerns; for example, lack of calibration and selection of variables being based on univariate analysis. The case was similar for all studies that reported models including FIT, with no study being rated as low overall for risk of bias and analysis concerns being the major driver of this (see Fig. 2).
Two further models were also externally validated: COLONOFIT 63 and COLONPREDICT. 19 COLONOFIT, which used the maximum value and number of values above 4 μg Hb/g of FIT across three samples, in addition to age, smoking status and history of previous colonoscopy, showed good discrimination for CRC (validation AUC = 0.86). COLONPREDICT, which uses FIT, demographics, symptoms, and blood tests, also suggested good discrimination for CRC (validation AUC = 0.92). COLONPREDICT and the FAST score were reported to be more accurate at predicting CRC than the English National Institute for Health & Care Excellence (NICE) Guideline 12 (NG12) 49 and Clinical Guideline 27 (CG27)-the NICE guideline for suspected cancer that preceded NG12. 30 Ayling and colleagues (2021) 46 also provided some validation of the ColonFlag score, an artificial intelligence learning algorithm, which was originally developed in an asymptomatic population. [87][88][89][90] They suggested that combining it with FIT could improve the sensitivity but discrimination and calibration were not reported.
Four studies reported on the combination of FIT/ gFOBT and other biomarkers. 60,65,70,75 One study obtained a high discrimination value for CRC (AUC = 0.94) by including haemoglobin, platelets, white cell count, Mean Corpuscular Haemoglobin (MCH), MCV, serum ferritin, and CRP markers, in addition to FIT. 60 One other study reported on the combination of FIT and transferrin, but only reported accuracy measures (PPV = 20.4% for CRC). 32 Another study assessed the combination of FIT, transferrin, lactoferrin and FC, showing good discriminatory ability (AUC = 0.87), however, this was not validated. 65 One study that utilised a mixture of demographics, other biomarkers (colonocyte DNA, Mean Corpuscular Volume (MCV), Carcinoembryonic antigen (CEA)), rectal bleeding and gFOBT showed good discrimination for CRC (AUC = 0.88).
FIT combined with faecal calprotectin had high AUC for CRC, using either two samples from both tests (AUC = 0.89) 43 or a single sample from each test (AUC = 0.91), 45 but neither study provided either internal or external validation. Seven studies, reported varying results for accuracy when combining FIT with faecal calprotectin alone or with other variables (see Table 2). 36,[43][44][45]67,71,86 Three studies combining FIT and haematological tests such as anaemia/iron deficiency and thrombocytosis reported PPVs for CRC in the range 4%-9%. 51,67,68 FIT models assessing CRC and ACP/ACN or colorectal neoplasia alone Eight studies reported the discriminatory ability of FIT and other variables to assess CRC combined with other outcomes (e.g. advanced adenoma; AA) or such outcomes alone (e.g. ACN; see Fig. 4). 18,19,34,40,44,45,63,65 The FAST score was originally developed for ACN, and it showed some discriminatory ability (AUC = 0.79) 40 ; when externally validated this discriminatory ability was maintained (AUC = 0.79). 18 Similar accuracy measures were obtained in these studies when using a cut-off score >4.5 for the outcome of CRC and HRA. 46 Similar results for COLONPREDICT were observed when assessing the outcome of ACN (validation AUC = 0.82). 19 COLONOFIT had a similar discriminatory ability for the outcome of CRC combined with advanced adenoma (AA), (validation AUC = 0.79). 63 One study utilised machine learning methods to develop a model using bacterial biomarkers in addition to FIT for prediction of CRC and advanced adenoma (AA) combined, suggesting good discrimination (AUC = 0.84). 34      FIT, FC, transferrin and lactoferrin showed poor discrimination (AUC = 0.67) for the prediction of adenomas. 65 Assessing for the combined outcome of CRC and high-grade dysplasia, the combination of FIT and faecal calprotectin had high discriminatory ability (AUC = 0.95), 44 but the study included only 430 people and did not report internal or external validation. One further study reported the combination of FIT with FC had poor discriminatory ability for HRA (AUC = 0.69) and all adenomas (AUC = 0.6) 45 The combination of FIT and FC had a varying reported PPVs for outcomes such as ACN and HRA (PPV range = 6.3-22.9%). 36,71,86

Non-FIT models
The remaining 39 studies did not include FIT/gFOBT and assessed models that utilised a mixture of symptoms, haematological tests, medical history, and demographical information. 27 Table 3.

Biomarker-based models
Twelve studies reported on models that included one or more tests from routine blood panels or biomarkers. 37,39,48,55,59,62,69,[74][75][76]79,82 The most commonly reported biomarker was carcinoembryonic antigen (CEA; n = 8, three of which had a case-control design). 37,39,48,69,[74][75][76]82 One study assessed the combination of Golgi protein-73 and CEA and reported high discriminatory ability for CRC (AUC = 0.98); but the study included only 90 people and had a case-control design. 75 Two studies reported development of models, with no validation, for combinations of other biomarkers (see Table 3). 79,82 Three further studies developed and externally validated various biomarker combinations, without including sex and age as factors. 48,55,76 All three showed good discriminatory ability for CRC in Danish (AUC = 0.82 and 0.86), 48,76 Chinese (AUC = 0.94) 55 and patients. Finally, one study that only provided accuracy measures, suggested combining CEA and leucocyte adherence inhibition had a high PPV (54%) for CRC. 37 All of these studies were rated as high risk of bias, mainly due to concerns regarding analysis (e.g. lack of appropriate calibration). Four other studies reported varying accuracy in development models using multiple different biomarkers combined with age and sex but did not externally validate results. 39,69,74,82 Demographics, symptoms, and medical history-based models The Bristol-Birmingham (BB) equation was developed and validated using the UK THIN primary care database, identifying multiple symptoms and providing one Review of the highest discrimination values for CRC (AUC = 0.92). 56 However, there were some concerns regarding the identification and applicability of the outcome in the risk of bias assessment. The BB equation was validated within the study and compared against the CAPER (Cancer Prediction in Exeter) score, suggesting it was superior in identifying CRC (validation AUC = 0.79). 56 One study developed and validated a model using change in bowel habit (CIBH) and weight loss, although patients must have presented with rectal bleeding. 29 Only the validation AUC was reported; this suggested good discrimination for CRC (0.88). Another study that utilised a combination of demographics, symptoms and iron deficiency anaemia suggested good discriminatory ability for CRC in development (AUC = 0.87) and validation  (AUC = 0.86) cohorts. 57 However, there were concerns regarding the handling of missing data in the analysis, which were coded as absent/missing and meant the predictive value of symptoms may have been overestimated.
A study in Australian patients developed and validated a model using demographics, lifestyle, and past medical history factors for prediction of CRC and colon and rectal cancers separately. 72 While the model showed moderate     discrimination for all three outcomes in development, and the CRC and colon models maintained adequate discrimination after validation (AUC = 0.7 and 0.72, respectively), the discrimination for rectal cancer was less than adequate after validation (AUC = 0.64). Two development studies combined medical history, demographics, symptoms and haematological tests, providing good discriminatory ability for CRC (AUC ≥0.83). 27,28 Another development model utilised age and sex with CIBH (excluding constipation) and the presence of blood in stool with age and sex and demonstrated good discriminatory ability for CRC (AUC = 0.97). 64 An issue of applicability was present in this study; rectal bleeding was a pre-requisite for inclusion. 64 Only one of these four studies provided some form of validation (internal). 27

Scored-based models
Three papers reported development 41 and validation 38,61 of a weighted numerical score (also known as the Selva score), which combines demographics, history and symptoms, for CRC prediction. The results suggested a good to moderate discriminatory ability (AUC development = 0.86, 41 validation = 0.76) 61 in a secondary care setting. A similar score-based model-incorporating age, indication of bleeding, minimum MCH, minimum ferritin, median WBC, and median platelet count-was reported to have adequate discrimination after validation (AUC = 0.73), but was only available as a conference abstract so detail was limited. 58 Each of these studies were rated as having a high risk of bias, mainly due to reporting of analysis. One study (of the Selva score) also had concerns regarding patient and outcome applicability. 41 The QCancer for CRC risk was developed and validated using the UK QResearch database. 47,66 This algorithm, included demographics, history, and symptoms, with some factors only considered for males and some only for females (Table 3). 47,66 Results suggested good discriminatory ability for CRC (AUC = 0.91 for men and 0.89 women). Net benefit analysis showed QCancer to be better than an "investigate all" or "investigate none" approach. 47 Additionally, the validation study was rated as low risk of bias, only one of two studies to attain this rating. 47,73 The other study that attained a low risk of bias was similar to the QCancer algorithm, utilising historical variables to assess male and female risk separately; however, only internal validation was performed and the AUC indicated less than adequate discrimination (0.68). 73 Another study developed a score-based algorithm with an array of factors (see Table 3), reporting good discriminatory ability for CRC (AUC = 0.83). 33

Machine learning models using GP records
Four studies applied machine learning techniques to medical notes (e.g. GP records). 50,53,54 All three models, which were developed in Dutch patients' records, showed good discrimination for CRC (AUC range = 0.81-0.9). One of these studies utilised the BB equation to aid the development of their most accurate model. 53 Another study explicitly focused on nonmetastatic CRC using a case-control study design (Swedish cancer registry) to create a model using multiple symptoms and medical history, reporting good discriminatory ability (validation AUC = 0.83). 78 There were major concerns regarding these studies and how they identified predictors and outcomes. All studies utilised medical records from their respective countries; three from the Netherlands, 50,53,54 and one from Sweden, 78 which could limit their.

Non-FIT models assessing CRC and ACP/ACN or colorectal neoplasia alone
Eleven studies reported discriminatory ability of varying models for the identification of other outcomes (e.g. AA) alone or in combination with CRC (see Fig. 4). 28,33,39,52,55,59,64,76,79,82 One study assessed the combination of several biomarkers for prediction of AA and reported poor discriminatory ability after validation (Table 3; AUC = 0.65). 76 There were concerns about how the predictors where determined. Four other studies combined demographic information (e.g. age) and/or various biomarkers. 39,76,79,82 Poor discriminatory ability was observed when assessing only AA (AUC = 0.65) 76 and HRAs (AUC = 0.61-0.65). 79,82 Discriminatory ability improved when attempting to predict CRC and HRA (AUC = 0.7-0.76). 39,79,82 However, poor results were observed for the combination of age, sex, hypertension and abdominal pain for the prediction of CRC and adenoma (AUC = 0.65). 52 One study assessed a single biomarker (serum matrix metalloproteinase 9) with age, sex, symptoms, white blood cell count, lifestyle factors and hypertension, and reported adequate discrimination for the prediction of colorectal neoplasia (defined as presence of adenocarcinoma or HRA) (internal validation AUC = 0.73), 59 but did not undertake external validation.
One development study combined medical history, demographics, symptoms and haematological tests, providing and adequate discrimination ability for AA (AUC = 0.7). 28 A similar study, utilising demographics, history (e.g. family, medication), and symptoms, also reported adequate ability for CRC and AA combined (AUC = 0.76). 33 One study, including hypertension and abdominal pain, had poor discrimination for CRC and adenoma prediction (AUC = 0.65). 52 One study reported an adjusted model (AUC = 0.73; cross-validation) and a score-based model (AUC = 0.75; cross-validation) combining demographics, family and medical history, and symptoms for the prediction of CRC and AA. 33 Calibration was lacking. The highest recorded discriminatory ability for a combined outcome (in this case polyps and CRC) was reported by combining age, sex, blood mixed in stool and CIBH (AUC = 0.92). 64 However, there were concerns regarding the participants, outcome identification, analysis, and the applicability of the study.

Discussion
This systematic review identified 62 studies assessing risk prediction models for CRC and/or ACP in symptomatic patients. Of these, 23 assessed models containing tests for blood in stool (21 FIT-based; one gFOBT-based) and 39 assessed non-FIT/gFOBT based models. Twenty-one of the 62 studies were conducted solely in primary care populations. Overall, the evidence suggests prediction models including FIT consistently have good accuracy and discriminatory ability (i.e. AUC > 0.8).
Some models that did not include FIT also had high levels of accuracy and discrimination, but this was not a consistent finding. In addition, eight of the studies assessing non-FIT predictive models had a case-control study design, 62,[75][76][77][78][79][80] which could have overestimated model usefulness. Models, irrespective of whether they included FIT, generally had higher discriminatory ability for CRC than for CRC combined with ACP or ACP alone. For example, the FAST score (FIT, age, and sex) reported AUC of 0.91 for CRC compared to 0.79 for advanced neoplasia in external validation. 18 Of note, only two studies in this review had a low risk of bias; neither of those models included FIT. 47,73 Moreover, several of the studies (n = 15) which reported AUC or similar measures did not report measures of dispersion. The majority of these were non-FIT models (n = 13).
FIT-based models varied in what other variables they included and, by and large, the number of included variables was unrelated to model performance. This, and the heterogeneity in the variables included, means rate of CRC in males than females 1 but the acceptability to patients, health professionals and health service decision-makers of different referral algorithms by sex requires investigation.
An important factor to consider when evaluating the potential utility of a risk prediction model is the setting for potential use. For example, three models that applied machine learning techniques to medical notes were developed in Dutch patients' records and, although the studies showed good discriminatory ability, it is not known if these models are applicable in other healthcare systems, where medical documentation styles may differ. 50,53,54 Such models require further external validation to demonstrate their generalisability to other data outside that used to develop the model. Related to this, few of the studies reported the ethnicity of the individuals in the population(s) in which they developed or validated their models. Therefore, an important caveat on the conclusions of the review is that, while some models perform well (and are validated), it is generally uncertain how they would perform in a population with a very different ethnic make-up.
In this review we also included studies where the outcome measure was PPV for combinations of variables; the rationale for this was our desire to provide a comprehensive overview of the current state of the evidence-base. All of these studies were classed as high risk of bias as PPV (a measure of diagnostic accuracy) is not considered to be an adequate outcome measure for risk prediction models, though is widely used by clinicians and policy makers. These studies were included because previous UK guidance for investigation of symptomatic patients has been based on PPVs. 92 Studies without FIT presented an array of different symptom combinations and identified some combinations with a high predictive value (e.g. rectal mass and bleeding had a PPV of 17% in one study). 80 Those which included FIT generally combined it with other blood or stool test results (e.g. faecal calprotectin, iron deficiency) and mostly reported high PPVs. Given these findings, and the fact that some of these other test results would either be available routinely as part of primary care blood panels or could be assessed in stool samples, future work assessing calibration and validation of models including FIT, other standard blood/stool test results and, potentially, combinations of symptoms, is warranted.
This review was conducted using a comprehensive search strategy, developed in combination with an information specialist, and utilised rigorous systematic review methodology. By focussing on risk prediction models published up to 2023, it both updates and extends a past systematic review on this topic (which included papers published to March 2014) 93 and the systematic review that informed the 2022 British Society of Gastroenterology/Association of Coloproctology of Great Britain and Ireland guidance on use of FIT in symptomatic patients, which focussed on diagnostic accuracy studies. 94 However, there are some limitations. Firstly, we excluded non-English language studies. While this, in theory, may have introduced some selection bias, research suggests that the chances of this are low. 95 Secondly, we did not perform data extraction in blinded duplicate: this could increase data extractions errors. However, a second reviewer assessed the data extraction for accuracy minimising or eliminating such error. Thirdly, studies utilising primary care databases/ cancer registries to identify CRC diagnoses were considered eligible for inclusion unless it was explicitly stated that the study population included asymptomatic or screening patients. The rationale for this was twofold: firstly, the review sought to be comprehensive and excluding these studies would have limited scope and introduced an element of selection bias and, secondly, in primary care, most CRCs are diagnosed through symptomatic services (even in settings with well-organised population-based screening programmes). However, it is possible these studies may have included a small proportion of asymptomatic patients. Fourthly, we included studies with a case-control design; while this was in order to be comprehensive, such studies may be more prone to bias and can overestimate model usefulness. These limitations were reflected in the risk of bias assessment for the relevant studies. Also considered in the risk of bias assessment was the method of investigation for neoplasia. Method of identification for the outcome of interest (i.e. CRC and/or ACN) varied. While many studies utilised colonoscopy alone (n = 25), some studies utilised varying methods of identification (e.g. sigmoidoscopy; n = 20) or used a database/registry without providing clarification as to how the outcome was identified in those patients (n = 15). While colonoscopy would generally be considered gold-standard, studies with varying methods of identification were included to reflect real-world practice, but it is possible that model performance may have varied if colonoscopy had been used.
This review was undertaken within a programme of work (COLOFIT) intended to inform optimal use of a FIT-based strategy for managing referral of patients with possible CRC symptoms presenting to primary care in NHS England (https://fundingawards.nihr.ac. uk/award/NIHR133852). The review findings suggest several recommendations for future research on risk prediction models for colorectal neoplasia in symptomatic patients; while some of these will be addressed in COLOFIT, they have internationally applicability. While it may seem obvious, to rigorously evaluate the likely performance of a model, it should be assessed in the population that is the intended target of the algorithm (here, most often, primary care populations); secondary or tertiary care populations are generally enriched for CRC/ACP making models potentially non-generalisable to primary care populations. Ideally, the ethnic composition of the population should be reported. Adequate validation should be undertaken, at a minimum internal validation, though ideally external. Authors should report all available data, including calibration plots and measures of dispersion for AUC, and consider conducting a net-benefit analysis to assess likely model effectiveness and compare their model to existing pathways. If including FIT, if possible, authors should report performance for different cut-offs and, if including symptoms, understanding the predictive value of individual symptoms would be valuable. As is evident from this review, many models have now been developed. However, the lack of data on net-benefit in appropriate target populations and external validation is a significant impediment to their wider implementation. Finally, real world studies of the impact of the use of prediction models on clinical decision-making and patient outcomes are urgently required. 96 The use of FIT in the symptomatic setting has significantly increased over recent years and, in some settings, guidance now advocates FIT for use in patients with features of possible CRC to guide referral for urgent investigation. This review shows that there is considerable promise for the use of risk prediction models, both FIT-based and non-FIT based, in identifying those most at risk of colorectal neoplasia. However, there are significant limitations in the evidence base, notably around the lack of net-benefit analysis and external validation, and the real-world impact of such algorithms is not yet understood.

Contributors
James S Hampton (JSH) and Ryan PW Kenny (RPWK) co-authored the first draft of the review protocol, contributed to development of the search strategy, undertook the screening and selection of articles, extracted data, synthesised results and co-authored the first draft of the manuscript.
Claire Eastaugh (CE) and Catherine Richmond (CR) provided expertise in developing and performing the searches and approved final manuscript for submission.
Colin J Rees (CJR) had the idea for the review, secured funding, edited and approved review protocol, contributed to development of the search strategy, edited and approved final manuscript for submission.
William Hamilton (WH) had the idea for the review, secured funding, edited and approved review protocol, contributed to development of the search strategy, edited and approved final manuscript for submission.
Linda Sharp (LS) had the idea for the review, secured funding, edited and approved review protocol, contributed to development of the search strategy, arbitrated any conflicts in the study selection process, edited and approved final manuscript for submission.
JSH and RPWK accessed and verified the data. LS, CJR and WH made the decision to submit the manuscript for publication.

Data sharing statement
All of the relevant data is contained within the manuscript and Supplementary material.